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Abstract. Solomonoff induction is held as a gold standard for learning, 
but it is known to be incomputable. We quantify its incomputability 
by placing various flavors of Solomonoff’s prior M in the arithmetical 
hierarchy. We also derive computability bounds for knowledge-seeking 
agents, and give a limit-computable weakly asymptotically optimal rein¬ 
forcement learning agent. 
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1 Introduction 

Solomonoff’s theory of learning nunni], commonly called Solomonoff induc¬ 
tion, arguably solves the induction problem m- for data drawn from any com¬ 
putable measure /i, Solomonoff induction will converge to the correct belief about 
any hypothesis [T]. Moreover, convergence is extremely fast in the sense that the 
expected number of prediction errors is E 0{'/E) compared to the number of 
errors E made by the informed predictor that knows /r [1]. 

In reinforcement learning an agent repeatedly takes actions and receives ob¬ 
servations and rewards. The goal is to maximize cumulative (discounted) reward. 
Solomonoff’s ideas can be extended to reinforcement learning, leading to the 
Bayesian agent AIXI m- However, AIXI’s trade-off between exploration and 
exploitation includes insufficient exploration to get rid of the prior’s bias [S], 
which is why the universal agent AIXI does not achieve asymptotic optimal¬ 
ity |I3II5) . 

For extra exploration, we can resort to Orseau’s knowledge-seeking agents. 
Instead of rewards, knowledge-seeking agents maximize entropy gain mm or 
expected information gain m- These agents are apt explorers, and asymptoti¬ 
cally they learn their environment perfectly mm- 

A reinforcement learning agent is weakly asymptotically optimal if the value 
of its policy converges to the optimal value in Cesaro mean [7] . Weak asymptotic 
optimality stands out because it currently is the only known nontrivial objective 

* The final publication is available at http://link.springer.com/ 












p 


{{x,q) e X* xQ \ P{x) > q} {{x, y, q) € X* x X* x Q \ P{xy | x) > q} 


M 

\ A? 

A° \ U 17?) 

^^novm 

\ (r? u 17?) 

A? \ (S? U 17?) 

M 

nl \ a°2 

A? \ (r? U 17?) 

-^^norm 

A? \ (X? U 17?) 

A? \ (r? U 17?) 


Table 1. The computability results on M, Mn, 
Lower bounds on the complexity of M and 
Turing machines. 


Section 3 


, M, and Mnorm proved in 
are given only for specific universal 


notion of optimality for general reinforcement learners mm . Lattimore defines 
the agent BayesExp by grafting a knowledge-seeking component on top of AIXI 
and shows that BayesExp is a weakly asymptotically optimal agent in the class 
of all stochastically computable environments P Ch. 5], 

The purpose of models such as Solomonoff induction, AIXI, and knowledge¬ 
seeking agents is to answer the question of how to solve (reinforcement) learning 
in theory. These answers are useless if they cannot be approximated in practice, 
i.e., by a regular Turing machine. Therefore we posit that any ideal model must 
at least be limit computable (A^). 

Limit computable functions are the functions that admit an anytime algo¬ 
rithm. More generally, the arithmetical hierarchy specifies different levels of com¬ 
putability based on oracle machines: each level in the arithmetical hierarchy is 
computed by a Turing machine which may query a halting oracle for the respec¬ 
tive lower level. 


In previous work m we established that AIXI is limit computable if re¬ 
stricted to e-optimal policies, and placed various versions of AIXI, AINU, and 
AIMU in the arithmetical hierarchy. In this paper we investigate the (in-)com- 
putability of Solomonoff induction and knowledge-seeking. The universal prior 
M is lower semicomputable and hence its conditional is limit computable. But 
M is a semimeasure: it assigns positive probability that the observed string has 
only finite length. This can be circumvented by normalizing M. Solomonoff’s 
normalization Mnorm preserves the ratio M(a;l)/M(xO) and is limit computable. 
If we remove the contribution of programs that compute only finite strings, we 
get a semimeasure M, which can be normalized to M„orm by multiplication 
with a constant. We show that both M and Mnorm are not limit computable. 


Our results on the computability of Solomonoff induction are stated in Table 1 


and proved in [Section 3| In [Section 4] we show that for finite horizons both the 
entropy-seeking and the information-seeking agent are Z\ 3 -computable and have 
limit-computable e-optimal policies. The weakly asymptotically optimal agent 
BayesExp relies on optimal policies that are generally not limit computable nni 
Thm. 16]. In Section 5 we give a weakly asymptotically optimal agent based on 


BayesExp that is limit computable. A list of notation can be found on page 14 













2 Preliminaries 


We use the setup and notation from nnj. 


2.1 The Arithmetical Hierarchy 

A set A C N is iff there is a computable relation S such that 

keA 3kiWk2 .. .Qnkn S{k,ki,... ,kn) (1) 

where Qn = V if n is even, = 3 if n is odd [m Def. 1.4.10]. A set A C N is 
77° iff its complement N \ A is 77°. We call the formula on the right hand side 
of Q a S^-formula, its negation is called U^-formula. It can be shown that 
we can add any bounded quantifiers and duplicate quantifiers of the same type 
without changing the classification of A. The set A is Z\° iff A is 77° and A is 
77°. We get that 77° as the class of recursively enumerable sets, 77° as the class 
of co-recursively enumerable sets and Zi° as the class of recursive sets. 

We say the set A C N is S^-hard (U^-hard, A^-hard) iff for any set 77 € 77° 
(77 G 77°, 77 G Z\°), 77 is many-one reducible to A, i.e., there is a computable 
function / such that fc G 77 gg f{k) G A O Def. 1.2.1]. We get 77° C A°^i C 
77°_|_;^ C . .. and 77° C 4\°_|_^ C 77°_|_;^ C .... This hierarchy of subsets of natural 
numbers is known as the arithmetical hierarchy. 

By Post’s Theorem m Thm. 1.4.13], a set is 77° if and only if it is recursively 
enumerable on an oracle machine with an oracle for a 77°-hard set. 


2.2 Strings 

Let X be some finite set called alphabet. The set X* := lJ]]^o X^ is the set of 
all finite strings over the alphabet X, the set X°° is the set of all infinite strings 
over the alphabet X, and the set X^ := X* U X°° is their union. The empty 
string is denoted by e, not to be confused with the small positive real number e. 
Given a string x G 7b*, we denote its length by JxJ. For a (finite or infinite) string 
X of length > k, we denote with Xi.,k the first k characters of x, and with x<fe 
the first fc — 1 characters of x. The notation xi:oo stresses that x is an inhnite 
string. We write x G y iff x is a prefix of y, i.e., x = yi:\x\- 

2.3 Computability of Real-valued Functions 

We fix some encoding of rational numbers into binary strings and an encoding 
of binary strings into natural numbers. From now on, this encoding will be done 
implicitly wherever necessary. 

Definition 1 (77°-, 77°-, Z\° -computable). A function f : X* —>■ K is called 
77°-computable (77°-computable, Z\°-computable) iff the set {(x, g) G 7b* x Q j 
fix) > q} is 77° (77°, Z\°j. 
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Table 2. Connection between the computability of real-valued functions and the 
arithmetical hierarchy. 


A A^-computable function is called computable, a A^-computable function is 
called lower semicomputable, and a iTj’-computable function is called upper semi- 
computable. A Z\ 2 -computable function / is called limit computable, because 
there is a computable function (j) such that 


lim (j){x, k) = f{x). 


The program </> that limit computes / can be thought of as an anytime algorithm 
for /: we can stop (j) at any time k and get a preliminary answer. If the program 
(j) ran long enough (which we do not know), this preliminary answer will be close 
to the correct one. 

Limit-computable sets are the highest level in the arithmetical hierarchy that 
can be approached by a regular Turing machine. Above limit-computable sets 
we necessarily need some form of halting oracle. See Table 2 for the definition 
of lower/upper semicomputable and limit-computable functions in terms of the 
arithmetical hierarchy. 


Lemma 2 (Computability of Arithmetical Operations). Let n > 0 and 

let f,g : X* M. be two A^-computable functions. Then 


(i) {{x,y) I f{x) > g{y)} is S'f, 

(a) {{x,y) I f{x) < g{y)} is 77°, 

(Hi) f g, f — g, and f ■ g are computable, 

(iv) f/g is A),-computable if g{x) 0 for all x, and 

(v) log / is A),-computable if f{x) > 0 for all x. 


3 The Complexity of Solomonoff Induction 

A semimeasure over the alphabet A is a function u : X* —> [0,1] such that 
(i) v{e) < 1, and (ii) v{x) > for all x S X*. A semimeasure is called 

(probability) measure iff for all x equalities hold in (i) and (ii). 

Solomonoff’s prior M US] assigns to a string x the probability that the 
reference universal monotone Turing machine U [III Ch. 4.5.2] computes a string 




M{xy \ x) > q 


3k3£oW > £o > q 



Fig. 1. A TT^-formula and an equivalent A' 2 -formula defining conditional M. Here 


(f>{x,k) denotes a computable function that lower semicomputes M{x). 


starting with x when fed with uniformly random bits as input. Formally, 



( 2 ) 


p: xC.U (p) 


The function M is a lower semicomputable semimeasure, but not computable 
and not a measure m Lem. 4.5.3]. A semimeasure v can be turned into a 
measure r'norm using Solomonoff normalization: r'norm(e) := 1 and for all x S X* 
and a G X, 



( 3 ) 


By definition, Mnorm and Mnorm are measures m Sec. 4.5.3]. Moreover, since 
MaoYm ^ Mj normalization preserves universal dominance. Hence Solomonoff’s 
theorem implies that Mnorm predicts just as well as M. 

The measure mixture M [31 p. 74] is defined as 



( 4 ) 
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The measure mixture M is the same as M except that the contributions by 
programs that do not produce infinite strings are removed: for any such program 
p, let k denote the length of the finite string generated by p. Then for \xy\ > k, 
the program p does not contribute to M{xy), hence it is excluded from M{x). 

Similarly to M, the measure mixture M is not a (probability) measure since 
M(e) < 1, but in this case normalization is just multiplication with the 
constant 1/M(e), leading to the normalized measure mixture Mnorm- When us¬ 
ing the Solomonoff prior M (or one of its sisters Mnorm, M, or Mnorm) for 
sequence prediction, we need to compute the conditional probability M{xy \ 
x) := M{xy)/M{x) for finite strings x,y G X*. Because M{x) >0 for all finite 
strings x G X*, this quotient is well-defined. 

Theorem 3 (Complexity of M, Mnormj M, and Mnorm)- 

(i) M{x) is lower semicomputable (v) M{x) is U^-computable 

(ii) M{xy \ x) is limit computable (vi) M{xy \ x) is A^-computable 

(Hi) Mnorm(a^) limit computable (vii) M„orm(a:) is A')-computable 

(iv) Mnoyy^{xy \ x) is limit computable (viii) M^amixy \ x) is A'^-computable 





Proof. (i) By |111 Thm. 4.5.2]. Intuitively, we can run all programs in parallel 
and get monotonely increasing lower bounds for M{x) by adding 2“ 1^1 every 
time a progr am p has c ompleted outputting x. _ 

(ii) Fr om (i) and Lemma 2 (iv), since M{x) > 0 (see also Figure 1). 

(iii) By Lemma 2 (iii,iv) and M{x) > 0. 

(iv) By (iii) and Lemma 2 (iv), since > M{x) > 0. 

(v) Let ^ be a computable function that lower semicomputes M. Since M is a 

semimeasure, M{xy) > J^zMixyz), hence M{xy) is nonincreasing 

in n and thus M(x) > q iff Vn3k ‘/’(xy, fc) > Q- 


(vi) From (v) and 

(vii) From (v) and 


Lemma 2 


Lemma 2 


(viii) From (vi) and 


Lemma 2 


(iv), since M(x) > 0. 

(iv). _ _ 

(iv), since Mnorm(x) > M(x) > 0. 


□ 


We proceed to show that these bounds are in fact the best possible ones. If 
M were zi^-computable, then so would be the conditional semimeasure M( • | •). 
Thus we could compute the M-adversarial sequence ziZ 2 ■ ■ ■ defined by 


zt '■= 


if M(I I z<t) > i, 
otherwise. 


The sequence Z 1 Z 2 ■ ■ ■ corresponds to a computable deterministic measure y. 
However, we have M^zx-t) < 2“* by construction, so dominance M{x) > Wyy{x) 
with > 0 yields a contradiction with t —>■ 00 : 

2“* > M{zi.,t) > w^y{zi.,t) =Wfj,>0 

By the same argument, the normalized Solomonoff prior Mnorm cannot be Ai- 
computable. However, since it is a measure, or 7T°-computability would 
entail ZlJ-computability. 

For M and Mnorm we prove the following two lower bounds for specific uni¬ 
versal Turing machines. 

Theorem 4 (M is not Limit Computable). There is a universal Turing 
machine U' such that the set {{x,q) \ Mui(x) > q} is not in Zl®. 

Proof. Assume the contrary, let A be but not A 2 , and let 5" be a computable 
relation such that 

uGA \/k3i S{n,k,i)- (5) 

For each n G N, we define the program as follows. 


output I^+iQ 
fc := 0 
-while true : 
i := 0 

while not S{n,k,i). 

i := * -I- 1 
k := k + l 

output 0 



















Each program always outputs Furthermore, the program outputs 

the infinite string if and only if n S ^by We define U' as follows 

using our reference machine U. 

— [/'(l"+iO): Runp„. 

— U'loOp): Run U{p). 

— U'{01p): Run U{p) and bitwise invert its output. 

By construction, U' is a universal Turing machine. No Pn outputs a string start¬ 
ing with 0”+il, therefore M,7/(0"+il) = i (M[/(0"+il) Mc/(l"+iO)). Hence 

M[/.(l"+iO) = 2-"-2l^(n) + + bMu(0"+il) 


If n ^ H, then = Mf//Otherwise, we have |M(7/(l"+^0) — 

Mc7-(0"+H)| = 2-”-2. _ 

Now we assume that Mjj' is limit computable, i.e., there is a computable 
function </) : T* x N —)■ Q such that limfc_,,oo k) = Mjji^x). We get that 

neH lim ^(0"+H,/c)-(;i(l”+i0,fc)>2-”-3, 

k—^oo 


thus A is limit computable, a contradiction. 


□ 


Corollary 5 (Mnorm is not - or n°2 -computable). There is a universal 
Turing machine U' such that {(x, g) | Mnormu'ix) > g} is not in or nl 


-,-k 


Proof. Since Mnorm = c • M, there exists a fc S N such that 2“ 
do not know the value of k). We can show that the set {(x, g) | M 
is not in Zl® analogously to the proof of Theorem 4 using 


< c (even if we 
normU' (^) P 


n€ A lim (/)(0"+4, k) - cf(r+^0, k) > 2-'=-”-3 


If Mnorm Were T'2-computable or TT^-computable, this would imply that Mnorm 
is Z\2-computable since Mnorm is a measure, a contradiction. □ 

Since M(e) = 1, we have M{x \ e) = M{x), so the conditional probability 
M{xy I x) has at least the same complexity as M. Analogously for Mnorm and 
Mnorm since they are measures. For M, we have that M(a:: | e) = Mnorm(a^), so 
[Corollary 5 [ applies. All that remains to prove is that conditional M is not lower 
semicomputable. 

Theorem 6 (Conditional M is not Lower Semicomputable). The set 

{{x,xy,q) I M{xy | x) > g} is not recursively enumerable. 

Proof. Assume to the contrary that M{xy \ x) is lower semicomputable. Accord¬ 
ing to 0 Thm. 12] there is an infinite string zi-oo such that Z 2 t = Z 2 t-i for all 
t > 0 and 

liminf M(zi: 2 t | z< 2 t) < 1- 


(6) 





Define the semimeasure 


v{xi.,t) := 


Ylk^l M{x<2k I X^2k-l) if VO < 2fc < t X2k = X2k-1 


0 


otherwise. 


Since we assume M{x<^ 2 k \ a^< 2 fc-i) to be lower semicomputable, v is lower 
semicomputable. Therefore there is a constant c > 0 such that M{x) > cv{x) 
for all X G A’*. With the chain rule we get for even-lengthed x with X 2 k = a; 2 fc-i 


c < 


M{x) 


Ul^i^ixi-.i I X<i) 


tl2 

^l[M{xv. 

/c=l 


iy{x) Ylk=l^i^<2k I a;<2fe-l) 

Plugging in the sequence zi-,oo, we get a contradiction with ([^; 

t 

0 < c < M{zi. 


Xc2k)- 


\ t—^OO f. 

■.2k I 2<2fc) -0 


□ 


k=l 


4 The Complexity of Knowledge-Seeking 

In general reinforcement learning the agent interacts with an environment in 
cycles: at time step t the agent chooses an action at G A and receives a percept 
Ci = iot,rt) G £ consisting of an observation Ot G O and a real-valued reward 
rt G K; the cycle then repeats for t -|- 1. A history is an element of {A x £)*. We 
use ee G Ay. £ to denote one interaction cycle, and Xi-t to denote a history of 
length t. A policy is a function tt : (A x £)* -G A mapping each history to the 
action taken after seeing this history. We assume A and £ to be finite. 

The environment can be stochastic, but is assumed to be semicomputable. In 
accordance with the AIXI literature [5], we model environments as lower semi¬ 
computable chronological conditional semimeasures (LSCCCSs). The class of of 
all LSCCCSs is denoted with A4. A conditional semimeasure v takes a sequence 
of actions ax:t as input and returns a semimeasure - |j ai,*) over £K A con¬ 
ditional semimeasure v is ehronologieal iff percepts at time t do not depend on 
future actions, i.e., v{ei.,t || ai.,k) = v{ei-,t || 01 ,^ for all k > t. Despite their 
name, conditional semimeasures do not specify conditional probabilities; the en¬ 
vironment y is not a joint probability distribution on actions and percepts. Here 
we only care about the computability of the environment v] for our purposes, 
chronological conditional semimeasures behave just like semimeasures. 

Equivalently to ([^, the Solomonoff prior M can be defined as a mixture 
over all lower semicomputable semimeasures using a lower semicomputable uni¬ 
versal prior |21j . We generalize this representation to chronological conditional 
semimeasures: we fix the lower semicomputable universal prior {wA)v^m with 
Wy > 0 for all y G Ai and ^ 1; given by the reference machine U 

according to := Sec. 5.1.2]. The universal prior w gives rise to 

the universal mixture which is a convex combination of all LSCCCSs M.: 

C(e<t II a<t) := ^ w^y{e<t || a<t) 

i^eM 





The universal mixture ^ is analogous to the Solomonoff prior M but defined for 
reactive environments. Analogously to [Theorem 3 (i), the universal mixture ^ 
is lower semicomputable [51 Sec. 5.10]. Moreover, we have ^norm > preserving 
universal dominance analogously to M. 


4.1 Knowledge-Seeking Agents 

We discuss two variants of knowledge-seeking agents: entropy-seeking agents 
(Shannon-KSA) [14ll6j and information-seeking agents (KL-KSA) [T7]. The en¬ 
tropy-seeking agent maximizes the Shannon entropy gain, while the information¬ 
seeking agent maximizes the expected Bayesian information gain (KL-divergence) 
in the universal mixture These quantities are expressed in the value function. 

In this section we use a finite lifetime m (possibly dependent on time step f): 
the knowledge-seeking agent maximizes entropy/information received up to and 
including time step m. We assume that the function m (of t) is computable. 

Definition 7 (Entropy-Seeking Value Function |16L Sec. 6]). The entro¬ 
py-seeking value of a policy tt given history as<t is 

'■= X! Cnorm (ei :m I e<t II oi :m ) l0g2 S norm (ei :m I C<£ II 

etim 


where Oi := 7r(e<i) for all i >t. 


Definition 8 (Information-Seeking Value Function [17L Def. 1]). The 

information-seeking value of a policy tt given history se<t is 


V7(*<,) := E E 


et-.m i^eM 


w :m II ai :m Liog :m I II ^l:m) 

II ^<t) ^ norm (ei :m I ^<t II ^l:m) 


where Oi := 7r(e<i) for all i >t. 

We use in places where either of the entropy-seeking or the information¬ 
seeking value function can be substituted. 

Definition 9 ((e-) Optimal Policy). The optimal value function V* is defined 
as V*(oe<t) := sup^ V’^(a9<t). A policy tt is optimal iff = V*{ie<^t) for 

all histories sect G (A x £)*. A policy tt is £-optimal iffV*{3e^t) — < £ 

for all histories £e<t € (A x £)*. 

An entropy-seeking agent is defined as an optimal policy for the value func¬ 
tion Vfi and an information-seeking agent is defined as an optimal policy for the 
value function Vf. 

The entropy-seeking agent does not work well in stochastic environments 
because it gets distracted by noise in the environment rather than trying to 
distinguish environments HTj. Moreover, the unnormalized knowledge-seeking 
agents may fail to seek knowledge in deterministic semimeasures as the following 
example demonstrates. 









Example 10 (Unnormalized Entropy-Seeking). Suppose we use ( instead of 


Definition 7 Fix A := {a,/3}, £ := {0,1}, and m := 1 (we only care about 


the entropy of the next percept). We illustrate the problem on a simple class of 
environments {vi,V 2 \'- 


a/0/0.1 /3/0/0.5 




a/1/0.1 /3/0/0.5 


where transitions are labeled with action/percept/probability. Both vi and V 2 
return a percept deterministically or nothing at all (the environment ends). 
Only action a distinguishes between the environments. With the prior := 
Wv 2 '■= 1/2, we get a mixture for the entropy-seeking value function V^. Then 
V^{a) « 0.432 < 0.5 = V^{j3), hence action j3 is preferred over a by the entropy¬ 
seeking agent. But taking action j3 yields percept 0 (if any), hence nothing is 
learned about the environment. 0 


Solomonoff’s prior is extremely good at learning: with this prior a Bayesian 
agent learns the value of its own policy asymptotically (on-policy value con¬ 
vergence) O Thm. 5.36]. However, generally it does not learn the result of 
counterfactual actions that it does not take. Knowledge-seeking agents learn 
the environment more effectively, because they focus on exploration. Both the 
entropy-seeking agent and the information-seeking agent are strongly asymptot¬ 
ically optimal in the class of all deterministic computable environments mm 
Thm. 5]: the value of their policy converges to the optimal value in the sense that 
—>■ V* almost surely. Moreover, the information-seeking agent also learns to 
predict the result of counterfactual actions [m Thm. 7]. 


4.2 Knowledge-Seeking is Limit Computable 

We proceed to show that e-optimal knowledge-seeking agents are limit com¬ 
putable, and optimal knowledge-seeking agents are in A^. 

Theorem 11 (Computability of Knowledge-Seeking). There are limit- 
computable e-optimal policies and A^-computable optimal policies for entropy¬ 
seeking and information-seeking agents. 


Proof. Since f, v, and are lower semicompu table, the value functions Vfj and 
Vf are Zl^-computable according to Lemma 2 (iii-v). The claim now follows from 
the following lemma. □ 


Lemma 12 (Complexity of (£-)Optimal Policies |10L Thm. 8 & 11]). If 

the optimal value function V* is A)^-computable, then there is an optimal policy 
TT* that is in and there is an e-optimal policy rr^ that is in 


5 A Weakly Asymptotically Optimal Agent in A” 

In reinforcement learning we are interested in reward-seeking policies. Rewards 
are provided by the environment as part of each percept Ct = {ot,rt) where 






Ot G O is the observation and r* S [0,1] is the reward. In this section we fix a 
computable discount function 7 : N —> M with 7(t) > 0 and 7(0 < oo- 

The discount normalization factor is defined as Ft := X]fct7(0- The effective 
horizon Ht{e) is a horizon that is long enough to encompass all but an e of the 
discount function’s mass: 


Ht{s) := mm{k \ Ft+k/Ft < s}. 

Definition 13 (Reward-Seeking Value Function |10L Def. 20]). The re- 

ward-seeking value of a policy tt in environment v given history se^t is 

1 °° 

'-t{m)rmi'{ei:m I e<t || ai-.m) 

^ m — t et:m 


if Ft >0 and := 0 if Ft = 0 where at := 7r(e<i) for all i>t. 

Definition 14 (Weak Asymptotic Optimality [3 Def. 7]). A policy tt is 
weakly asymptotically optimal in the class of environments M iff the reward¬ 
seeking value converges to the optimal value on-policy in Cesdro mean, i.e., 


\ E (K*(*<fc) - K:(*<fc)) ^ 0 


v-almost surely for all v £ AA. 


Not all discount functions admit weakly asymptotically optimal policies |3 Thm. 
8 ]; a necessary condition is that the effective horizon grows sublinearly [B] Thm. 
5.5]. This is satisfied by geometric discounting, but not by harmonic or power 
discounting [3 Tab. 5.41]. 

This condition is also sufficient [3 Thm. 5.6]: Lattimore defines a weakly 
asymptotically optimal agent called BayesExp P Ch. 5]. BayesExp alternates 
between phases of exploration and phases of exploitation: if the optimal in- 
formation-seeking value is larger than St, then BayesExp starts an exploration 
phase, otherwise it starts an exploitation phase. During an exploration phase, 
BayesExp follows an optimal information-seeking policy for F[t(et) steps. During 
an exploitation phase, BayesExp follows an ^-optimal reward-seeking policy for 
one step B Alg- 2 ]. 

Generally, optimal reward-seeking policies are TT^-hard [TUI Thm. 16], and for 
optimal knowledge-seeking policies we only proved that they are Ag. Therefore 
we do not know BayesExp to be limit computable, and we expect it not to 
be. However, we can approximate it using e-optimal policies preserving weak 
asymptotic optimality. 


Theorem 15 (A Limit-Computable Weakly Asymptotically Optimal 
Agent). If there is a nonincreasing computable sequence of positive reals (e^teN 
such that £4 —>■ 0 and Ht{st)/{tst) —>■ 0 os t —> 00 , then there is a limit- 
computable policy that is weakly asymptotically optimal in the class of all com¬ 
putable stochastic environments. 



Proof. Analogously to Theorem 3 (i) we get that ^ is lower semicomputable, 
and hence the optimal reward-seeking value function Vf is limit computable [101 

there is a limit-computable 2 “*-optimal reward- 

there 


Lem. 21]. Hence by 


Lemma 12 


seeking policy ttj for the universal mixture f (TUI Cor. 22 ]. By Theorem 11 
are limit-computable e(/ 2 -optimal information-seeking policies ttj with lifetime 
t + Ht{et). We define a policy tt analogously to BayesExp with ttj and tt^ instead 
of the optimal policies: 


If H/(oe<t) > £( for lifetime t + then follow ttj for Ht{et) steps. 

Otherwise, follow ttj for one step. 


Since Vf, tt/, and are limit computable, the policy tt is limit computable. 
Furthermore, ttj is 2“*-optimal and 2“‘ —>• 0, so {^<t) —t (®<t) as t —>■ oo. 

Now we can proceed analogously to the proof of [ 6 l Thm. 5.6], which consists 
of three parts. First, it is shown that the value of the ^-optimal reward-seeking 
policy 7 r| converges to the optimal value for exploitation time steps (second 

branch in the definition of tt) in the sense that —>■ V*. This carries over to 
the 2 “*-optimal policy tt^, since the key property is that on exploitation steps, 
Vf < Et] i.e., TT only exploits if potential knowledge-seeking value is low. In short, 
we get for exploitation steps 

Vp{£B<_t) -t -t V]r^(£e<t) -t V*{se<^t) as t-)■ oo. 

Second, it is shown that the density of exploration steps vanishes. This result 
carries over since the condition Vf{3e^t) > £t that determines exploration steps 
is exactly the same as for BayesExp and ttj is £(/2-optimal. 

Third, the results of part one and two are used to conclude that tt is weakly 
asymptotically optimal. This part carries over to our proof. □ 


6 Summary 


When using Solomonoff’s prior for induction, we need to evaluate conditional 
probabilities . We showed that conditional M and M^om are li mit comput able 
(Theorem 31, and that M and Mnorm are not limit computable (Theorem 4 and 


Corollary 5 

I; see 

Table 1 

on page 2 


page 2 This result implies that we can approximate 


M or Mnorm for prediction, but not the measure mixture M or M„orm- 

In some cases, normalized priors have advantages. As illustrated in |Exam-| 
pie I 0 [ unnormalized priors can make the entropy-seeking agent mistake the en¬ 
tropy gained from the probability assigned to finite strings for knowledge. From 
Mnorm > M we get that A/norm predicts just as well as M, and by [Theorem 3| 
we can use Mnorm without losing limit computability. 

Any method that tries to tackle the reinforcement learning problem has to 
balance between exploration and exploitation. AIXI strikes this balance in the 
Bayesian way. However, this does not lead to enough exploration Our 

agent cares more about the present than the future—hence an investment in 




















form of exploration is discouraged. To counteract this, we can add a knowledge¬ 
seeking component to the agent. In [Section"^ we discussed two variants of 
knowledge-seeking agents: entropy-seekers [T^] and information-seekers [T7]. We 
showed that £-optimal knowledge-seek ing agents are limit computable and opti¬ 
mal knowledge-seeking agents are (Theorem 11). 


We set out with the goal of finding a perfect reinforcement learning agent 
that is limit computable. Weakly asymptotically optimal agents can be con¬ 
sidered a suitable candidate, since they are currently the only known general 
reinforcement learning agents which are optimal in an objective sense [5]. We 
discussed Lattimore’s BayesExp P Ch. 5], which relies on Solomonoff induction 
to learn its environment and on a knowledge-seeking component for extra ex¬ 
ploration. Our results culminated in a limit-computable weakly asymptotically 


goal has been achieved. 


optimal agent (Theorem 15). based on Lattimore’s BayesExp. In this sense our 
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List of Notation 

:= defined to be equal 
N the natural numbers, starting with 0 
A, B sets of natural numbers 

1 a the characteristic function that is 1 if its argument is an element of the 
set A and 0 otherwise 

X* the set of all finite strings over the alphabet X 
X°° the set of all infinite strings over the alphabet X 

X'^ X'^ := X* U X°°, the set of all finite and infinite strings over the alphabet 

df 

X, y finite or infinite strings, x,y £ X^ 

X Qy the string a; is a prefix of the string y 
€ the empty string, the history of length 0 
e a small positive real number 
A the (finite) set of possible actions 

O the (finite) set of possible observations 

£ the (finite) set of possible percepts, 5 c O x M 

M Solomonoff’s prior defined in ([^ 

M the measure mixture defined in Q 

^'norm Solomonoff normalization of the semimeasure v defined in ([^ 
a, /? two different actions, a, (3 £ A 
at the action in time step t 
et the percept in time step t 
Ot the observation in time step t 

Tt the reward in time step t, bounded between 0 and 1 
the first t — 1 interactions, 01610262 ... Ot_i 6 t_i 
7 the discount function 7 : N —)■ M>o 
Ft a discount normalization factor. Ft := X]i=t 7 (*) 

Ht{e) the effective horizon, Ht{e) = min{77 | Ft+n/Ft < e} 


TT 


a policy, i.e., a function tt : (^ x £)* —)• A 


T/TT 

the entropy-seeking value of the policy tt (see 

Definition 7 

) 


V7 

the information-seeking value of the policy tt 

(see 

Dehnition 8 

I 

yTT 
* u 

the reward-seeking value of policy tt in environment v (see 

DiB 

mition 131 


the entropy-seeking/information-seeking/reward-seeking value of policy 


TT 

Y* the optimal entropy-seeking/information-seeking/reward-seeking value 
(p a computable function 

S a computable relation over natural numbers 

n,kA natural numbers 
t (current) time step 

m lifetime of the agent (a function of the current time step t) 

A4 the class of all lower semicomputable chronological conditional semimea¬ 
sures; our environment class 
ly lower semicomputable semimeasure 

/r computable measure, the true environment 
^ the universal mixture over all environments in M 


Open Questions 


1. Can the upper bound of Zig for knowledge-seeking policies 

2. Is BayesExp limit computable? 

3. Does the lower given in |Theorein~4 and Corollary 5 hold 
Turing machine? 


be improved? 
for any universal 


We expect the answers to questions 1 and 2 to be negative and the answer to 
question 3 to be positive. 


















